WOMBAT 2025 Tutorial

Visualising Uncertainty

Harriet Mason, Dianne Cook

Department of Econometrics and Business Statistics

Welcome 👋🏼

Thanks for joining to learn about making data plots today.

🦘 Di is a Professor of Statistics. She has more than 30 years of research and teaching of data visualisation, and open source software development.
🐨 Harriet is a final year PhD student, working on better representation of uncertainty in data visualisations, particularly focused on spatial data.

We are both in Econometrics and Business Statistics, at Monash University.

🧩 Feel free to ask questions any time. 🤔


🎯 The objectives for today are:

Caution: Incorporating uncertainty into plots is far from state of best practices. This is our best attempt to summarise current literature, what we find to be valuable approaches and available tools.

Session 1
Foundations of uncertainty in data visualisations

Introduction

What is uncertainty?



You don’t know what you don’t know

  • Statistical (aleatory) uncertainty
    • notion of randomness
    • variability in the outcome/measurements
  • Systemic (epistemic) uncertainty
    • due to bias, misunderstanding, assumptions
    • measurement error
    • handling of missing information, and pre-processing choices
    • model choices
    • incorrect comparisons
    • can you think of any others?

Mostly we are concerned about representing statistical uncertainty.

Common measures and representations

Show the data (1/4)

The most valuable way to show uncertainty is to show all the data.

Plot shows first preference % for greens in the 2019 Australian Federal Election, for the 150 electorates.

Plot of choice is the jittered dotplot, where points are spread vertically according to density.

Code
election <- read_csv(here::here("session1/data/election2019.csv"),
  skip = 1,
  col_types = cols(
    .default = col_character(),
    OrdinaryVotes = col_double(),
    AbsentVotes = col_double(),
    ProvisionalVotes = col_double(),
    PrePollVotes = col_double(),
    PostalVotes = col_double(),
    TotalVotes = col_double(),
    Swing = col_double()
  )
)
e_grn <- election |>
  group_by(DivisionID) |>
  summarise(
    DivisionNm = unique(DivisionNm),
    State = unique(StateAb),
    votes_GRN = TotalVotes[which(PartyAb == "GRN")],
    votes_total = sum(TotalVotes)
  ) |>
  mutate(perc_GRN = votes_GRN / votes_total * 100)

e_grn |>
  mutate(State = fct_reorder(State, perc_GRN)) |>
  ggplot(aes(x=perc_GRN, y=State)) +
    geom_quasirandom(groupOnX = FALSE, varwidth = TRUE) +
    labs(
      x = "First preference votes %",
      y = ""
    ) +
  xlim(c(0,50))

Show the data (2/4)

What do we learn?

  • Different number of observations in each state
  • One outlier in Vic
  • As a group, ACT has higher %’s
  • Vic has a small cluster of points with higher %’s
  • %’s are mostly very low

This plot ONLY shows uncertainty!

Show the data (3/4)

What would be other common ways to display this data?

  • Side-by-side boxplots
  • Side-by-side violin
  • On a map of electorates

For each plot think about

  • what is uncertainty, and what is estimate
  • what the plot shows or hides

Code
e_grn |>
  mutate(State = fct_reorder(State, perc_GRN)) |>
  ggplot(aes(x=perc_GRN, y=State)) +
    geom_boxplot(varwidth = TRUE) +
    labs(
      x = "First preference votes %",
      y = ""
    ) +
  xlim(c(0,50))

Code
e_grn |>
  mutate(State = fct_reorder(State, perc_GRN)) |>
  ggplot(aes(x=perc_GRN, y=State)) +
    geom_violin(draw_quantiles = c(0.25, 0.5, 0.75),
      fill="#006dae", alpha=0.5) +
    labs(
      x = "First preference votes %",
      y = ""
    ) +
  xlim(c(0,50))

Code
oz_states <- ozmaps::ozmap_states %>% filter(NAME != "Other Territories")
oz_votes <- rmapshaper::ms_simplify(ozmaps::abs_ced)
oz_votes_grn <- full_join(oz_votes, e_grn, by=c("NAME"="DivisionNm"))

ggplot(oz_votes_grn, aes(fill=perc_GRN)) +
  geom_sf(colour="white") +
  scale_fill_viridis_c(direction=-1, trans = "log", 
    guide = "colourbar", 
    labels = scales::label_number(accuracy = 0.1)) +
  theme_map() +
  theme(legend.position = "right", 
    legend.title = element_blank())

Show the data (4/4)

Even when you think you are showing the data, it is often an estimate and some representation of uncertainty.



The election data is actually estimates. The electorates are strata, so what was shown was % computed on each strata.

What is the full data? What are different strata possible?

Generally, we trust the values provided by AEC, and we explore the distribution of votes by different strata in the electorate structure. The goal being to understand the variability in the way the people have voted, identify electorates where the winner might flip next time, …












It’s really difficult to concretely define uncertainty!

Terminology

Names for main thing:

  • estimate
  • statistic
  • signal

Names for uncertainty, needed to understand main thing:

  • variation
  • variability
  • variance/standard deviation
  • error/standard error
  • IQR/MAD
  • noise

Displaying uncertainty is described signal suppression.

Example: distributions

Code
load("data/melbtemp.rda")
melbtemp_2019 <- melbtemp |>
  filter(year == 2019)
  
d1 <- ggplot(melbtemp_2019, aes(x=month, y=temp)) +
  geom_quasirandom() + 
  stat_summary(geom="point", fun="median", 
    colour="red", size=3) +
  xlab("") + ylab("Temp (C)") +
  ggtitle("A. ggbeeswarm::geom_quasirandom")
  
library(ggforce)
d2 <- ggplot(melbtemp_2019, aes(x=month, y=temp)) +
  geom_violin(fill = "#6F7C4D", colour=NA, alpha=0.7) +
  geom_sina() +
  xlab("") + ylab("Temp (C)") +
  ggtitle("B. geom_violin + ggforce::geom_sina")

library(ggridges)
d3 <- ggplot(melbtemp_2019, aes(x=temp, y=month)) +
  geom_density_ridges(scale = 1.5, 
                      quantile_lines = TRUE,
                      quantiles = 2,
                      fill = "#6F7C4D") +
  xlab("Temp (C)") + ylab("") + 
  theme_ridges() +
  ggtitle("C. ggridges::geom_density_ridges")

library(ggdist)
d4 <- ggplot(melbtemp_2019, aes(x=temp, y=month)) +
  stat_halfeye(fill="#6F7C4D", alpha=0.7) +
  geom_point(pch = "|", size = 2,
    position = position_nudge(y = -.15)) +
  xlab("Temp (C)") + ylab("") +
  ggtitle("D. ggdist::stat_halfeye")

lout <- c(area(1,2),
          area(3),
          area(4))
lout <- "
AACD
BBCD
"
d1 + d2 + d3 + d4 + plot_layout(design=lout)

  • What is the main element?
  • How is the uncertainty displayed?
  • What is the uncertainty pattern? And then, what would be an appropriate representation?
  • What are the key features of the data that we need to preserve in a plot?
  • Why the different aspect ratios?

Your turn

  1. Decide on the appropriate information about the variability that needs to be included in the display.
  2. Play with different options on your choice of display to make various displays. Aim to have three different designs.
  3. Is there a winner, or several roughly equally good displays?
10:00

How this affects perception

Why “signal suppression”? (1/4)

Regression

Why “signal suppression”? (2/4)

Longitudinal data

Why “signal suppression”? (3/4)

Forecasts

Why “signal suppression”? (4/4)

Map

Why can’t we just treat uncertainty as another variable? * Uncertainty is not just another variable… * It presents an interesting perceptual problem * Two sides of the same coin * Usually do not want variables to interfere with each other * In uncertainty visualisation, the opposite is true * This is the core of the signal suppression approach we implement

Uncertainty representation in different types of problems

from the ggdibbler notes the conditional line plot. All the examples are from time series packages, because I spent a bit of time looking for the one that returned a distributional forecast so I could use it in an example

Common ways we express uncertainty in R and when you would see them out in the wild? - Distributional (vectorised distributions that are wrappers on the d, p, q, r functions) - e.g. when you get it as output - fable, some Bayes stuff (e.g. posterior) - Base R (d, p, q, r functions) - Confidence intervals - hi low variables in the forecast packages - Bootstrapping (?)

Deciding which is the best design

Evaluation criteria - Uncertainty visualisation should:

  1. Reinforce justified signals
  2. Hide signals that are just noise Similar logic to a hypothesis test (but not as binary, and without set thing you are checking for)

Accessibility considerations

  • Colour
    • 2D palette is harder to read as colour is not a simple 3D space. Mapping a variable to saturation hurts accessibility. (Maybe specifically mention this in section 2?. I am not sure of other accessibility issues other than colour issues (and maybe size?).

Miscellaneous

  • bag plot
  • model only (regression, wages example) vs adding data
  • ggdist
  • tidyindex
  • animation over space

Where to learn more

End of session 1

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.